Second-order Temporal Pooling for Action Recognition
نویسندگان
چکیده
Most successful deep learning models for action recognition generate predictions for short video clips, which are later aggregated into a longer time-frame action descriptor by computing a statistic over these predictions. Zeroth (max) or first order (average) statistic are commonly used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable action pooling scheme temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of per-frame CNN features across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their firstorder counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on four standard and fine-grained action recognition datasets. Our results clearly demonstrate the advantages of higher-order pooling schemes, achieving state-of-the-art performance.
منابع مشابه
Order-aware Convolutional Pooling for Video Based Action Recognition
Most video based action recognition approaches create the video-level representation by temporally pooling the features extracted at each frame. The pooling methods that they adopt, however, usually completely or partially neglect the dynamic information contained in the temporal domain, which may undermine the discriminative power of the resulting video representation since the video sequence ...
متن کاملNon-Linear Temporal Subspace Representations for Activity Recognition
Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where act...
متن کاملPooling the Convolutional Layers in Deep ConvNets for Action Recognition
Deep ConvNets have shown its good performance in image classification tasks. However it still remains as a problem in deep video representation for action recognition. The problem comes from two aspects: on one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits its capability of capturing the complex video action information; on the other hand, tempor...
متن کاملSequence Summarization Using Order-constrained Kernelized Feature Subspaces
Representations that can compactly and effectively capture temporal evolution of semantic content are important to machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are charact...
متن کاملFace Identification with Second-Order Pooling
Automatic face recognition has received significant performance improvement by developing specialised facial image representations. On the other hand, generic object recognition has rarely been applied to the face recognition. Spatial pyramid pooling of features encoded by an over-complete dictionary has been the key component of many state-of-the-art image classification systems. Inspired by i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1704.06925 شماره
صفحات -
تاریخ انتشار 2017